- 06 Feb, 2025 1 commit
-
-
Michael Yang authored
* chore: update gitattributes * chore: add build info source
-
- 05 Feb, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 31 Jan, 2025 1 commit
-
-
Michael Yang authored
This reverts commit bea1f1fa.
-
- 30 Jan, 2025 1 commit
-
-
Michael Yang authored
-
- 29 Jan, 2025 1 commit
-
-
Michael Yang authored
* add build to .dockerignore * test: only build one arch * add build to .gitignore * fix ccache path * filter amdgpu targets * only filter if autodetecting * Don't clobber gpu list for default runner This ensures the GPU specific environment variables are set properly * explicitly set CXX compiler for HIP * Update build_windows.ps1 This isn't complete, but is close. Dependencies are missing, and it only builds the "default" preset. * build: add ollama subdir * add .git to .dockerignore * docs: update development.md * update build_darwin.sh * remove unused scripts * llm: add cwd and build/lib/ollama to library paths * default DYLD_LIBRARY_PATH to LD_LIBRARY_PATH in runner on macOS * add additional cmake output vars for msvc * interim edits to make server detection logic work with dll directories like lib/ollama/cuda_v12 * remove unncessary filepath.Dir, cleanup * add hardware-specific directory to path * use absolute server path * build: linux arm * cmake install targets * remove unused files * ml: visit each library path once * build: skip cpu variants on arm * build: install cpu targets * build: fix workflow * shorter names * fix rocblas install * docs: clean up development.md * consistent build dir removal in development.md * silence -Wimplicit-function-declaration build warnings in ggml-cpu * update readme * update development readme * llm: update library lookup logic now that there is one runner (#8587) * tweak development.md * update docs * add windows cuda/rocm tests --------- Co-authored-by:
jmorganca <jmorganca@gmail.com> Co-authored-by:
Daniel Hiltgen <daniel@ollama.com>
-
- 14 Jan, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 08 Jan, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 04 Jan, 2025 1 commit
-
-
Ubaldo Porcheddu authored
-
- 19 Dec, 2024 1 commit
-
-
Parth Sareen authored
This change adds a test to catch a regression in schema_to_grammar where the order of keys in the JSON schema is not preserved in the generated grammar, which is critical for step-by-step reasoning.
-
- 17 Dec, 2024 1 commit
-
-
Jesse Gross authored
Sometimes the KV cache requires defragmentation even without triggering the threshold heuristic. In this case, decoding will not being able to find a KV cache slot. This is particularly difficult for the caller to handle if it happens in between ubatches. To avoid this, we should immediately trigger a defrag. In addition, a heavily fragmented cache can require more than max_moves to defragment. Currently, we stop when we hit the limit but this can leave a cache that still does not have adequate space even after defragmentation is triggered. Instead, we should do multiple batches of processing until everything is complete. Fixes #7949
-
- 14 Dec, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 13 Dec, 2024 1 commit
-
-
Daniel Hiltgen authored
This puts the low-level runner logging back on stderr for consistency with prior releases
-
- 12 Dec, 2024 2 commits
-
-
Pascal Patry authored
-
Parth Sareen authored
-
- 11 Dec, 2024 2 commits
-
-
Blake Mizerany authored
Previously we decoded and re-encoded JSON schemas during validation, which served no purpose since json.RawMessage already validates JSON syntax. Worse, the re-encoding lost field ordering from the original schema, which affects inference quality during step-by-step reasoning. While fixing this ordering issue by using json.RawMessage directly, testing revealed that schema_to_grammar (from llama.cpp) also fails to preserve field order during grammar generation. This appears to be the root cause of inference degradation. This change prevents us from mangling the user's original schema order, but we still need to address the ordering issue in schema_to_grammar. That will be a separate change. Updates #7978
-
Jeffrey Morgan authored
-
- 10 Dec, 2024 2 commits
-
-
Daniel Hiltgen authored
The final implementation of #7499 removed dynamic vector requirements in favor of a simpler filename based model, and this was left over logic that is no longer needed.
-
Daniel Hiltgen authored
* llama: wire up builtin runner This adds a new entrypoint into the ollama CLI to run the cgo built runner. On Mac arm64, this will have GPU support, but on all other platforms it will be the lowest common denominator CPU build. After we fully transition to the new Go runners more tech-debt can be removed and we can stop building the "default" runner via make and rely on the builtin always. * build: Make target improvements Add a few new targets and help for building locally. This also adjusts the runner lookup to favor local builds, then runners relative to the executable, and finally payloads. * Support customized CPU flags for runners This implements a simplified custom CPU flags pattern for the runners. When built without overrides, the runner name contains the vector flag we check for (AVX) to ensure we don't try to run on unsupported systems and crash. If the user builds a customized set, we omit the naming scheme and don't check for compatibility. This avoids checking requirements at runtime, so that logic has been removed as well. This can be used to build GPU runners with no vector flags, or CPU/GPU runners with additional flags (e.g. AVX512) enabled. * Use relative paths If the user checks out the repo in a path that contains spaces, make gets really confused so use relative paths for everything in-repo to avoid breakage. * Remove payloads from main binary * install: clean up prior libraries This removes support for v0.3.6 and older versions (before the tar bundle) and ensures we clean up prior libraries before extracting the bundle(s). Without this change, runners and dependent libraries could leak when we update and lead to subtle runtime errors.
-
- 05 Dec, 2024 1 commit
-
-
Parth Sareen authored
Adds structured outputs to chat endpoint --------- Co-authored-by:
Michael Yang <mxyng@pm.me> Co-authored-by:
Hieu Nguyen <hieunguyen1053@outlook.com>
-
- 03 Dec, 2024 1 commit
-
-
Sam authored
-
- 29 Nov, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 27 Nov, 2024 1 commit
-
-
ItzCrazyKns authored
Closes #7627
-
- 26 Nov, 2024 2 commits
-
-
Jesse Gross authored
When processing a prompt, we look for image tags of the form [img-0], which are inserted by the Ollama server process. However, this can cause errors if the original prompt has these tags - typically an image not found error is returned. This changes tag searching behavior to be similar to the 0.3.x series, which will largely avoid these problems. However,they can still happen when input text with these tags is used with image models. The correct solution is to escape the tags but this is a larger issue with special sequences in general so this is an incremental fix that should avoid the problem for the majority of cases.
-
Jesse Gross authored
This also makes it easier to truncate long inputs the same as shifting but does not actually implement it. This type of truncation has a trade off between quality and time to first token.
-
- 23 Nov, 2024 1 commit
-
-
Jesse Gross authored
If there are no avilable slots for new sequences then a request will not be added to the processing queue but will continue on to wait for a response that never comes. Besides never giving a response to the request, this prevents the model from being unloaded due to the outstanding request. To prevent this, there are semaphores that prevent more requests from being processed than there are slots - one in the Ollama server and one in the runner. - The Ollama server one works but it is not designed to protect the runner's data internal structures and the runner can return a final response before clearing its data structures. - The internal runner semaphore has similar behavior where it can release the semaphore when it issues a response. This is wrong - it should only release the semaphore after it has cleared the data structure. In addition, we should return an error if a slot is not found rather than deadlocking in the event we ever get to this spot. Fixes #7779
-
- 22 Nov, 2024 1 commit
-
-
Daniel Hiltgen authored
Users get confused by "Failed to acquire semaphore" error="context canceled" messages in the logs, which are actually clients giving up. While there could be a legitimate hang bug in the system, sometimes this is just short client timeouts with an overloaded system, so this should help users understand what's going on better.
-
- 21 Nov, 2024 1 commit
-
-
boessu authored
-
- 20 Nov, 2024 5 commits
-
-
Jesse Gross authored
Previous versions of the runner would truncate inputs to the context window before beginning processing. The main processing loop relied on this behavior if the context needed to be shifted later (due to token generation). If truncation did not occur then invariants would be broken, causing crashes or infinite loops. Later versions attempted to fix these bugs and make the logic less subtle so that all inputs could be handled. Truncation was removed to make things consistent. However, truncation is much faster than processing and shifting, so removing it caused performance problems when the input vastly exceeded the context size. This restores the input truncation as a performance optimization while keeping the more robust processing logic. Fixes #7762
-
Jesse Gross authored
We need to track which tokens are in the cache ourselves. We currently add tokens to the cache tracker when we add them to batch but they are not actually in the cache until we call Decode. This can cause confusion when we are shifting the cache. Avoids "could not find a KV slot for the batch" issues. Bug #7545
-
Jesse Gross authored
We try to recover from errors by dropping the tokens that caused the problem and re-trying. However, dropping the tokens is not correct and continuing often leads to infinite loops. To avoid, this we end the sequence if such a condition is detected, which is also surprising. At this point, it is better to just report the error. This will make it easier to find problems and the alternatives are perhaps even more surprising to users. This is not a very satisfactory solution either - we should isolate the error and return it to the user without killing the whole process. However, this is an incremental step and consistent with most other failures (which either manifest as abort() or panic).
-
Jesse Gross authored
Fragmentation of the KV cache can occur due to cache shifting or different sequences getting processed. Decode uses a heuristic to decide if it should defrag. However, this heuristic isn't 100% accurate, so decoding can sometimes fail by surprise. For these cases, if decode indicates that there is no KV cache space, we should defrag and then try again.
-
Jesse Gross authored
This doesn't have any impact currently because NUM_PARALLEL is forced to 1 for embeddings, so both indicies will always be 0.
-
- 19 Nov, 2024 1 commit
-
-
Gabe Goodhart authored
https://github.com/ollama/ollama/issues/7656 Branch: Granite3StoppingBug-7656 Signed-off-by:
Gabe Goodhart <ghart@us.ibm.com>
-
- 15 Nov, 2024 2 commits
-
-
Jesse Gross authored
This is a partial revert of 8a35bb92 "runner.go: Increase survivability of main processing loop", removing the panic handler. Although we want to avoid errors taking down the runner, we also should make the user aware of problems when they happen. In the future, we can restructure things so both parts are true.
-
Jesse Gross authored
Currently, if an error occurs during the prep stages (such as tokenizing) of a single request, it will only affect that request. However, if an error happens during decoding, it can take down the entire runner. Instead, it's better to drop the tokens that triggered the error and try to keep going. However, we also need to stop when we run out of tokens, otherwise, this just causes an infinite loop. This is likely the cause of at least some of the hanging issues that have been reported. Bug #7573
-
- 14 Nov, 2024 3 commits
-
-
Jesse Gross authored
It's possible to get prompts that consist entirely of whitespace - this is most likely to happen when generating embeddings. Currently, we will trim this away, leaving an empty prompt, which will then generate an error. Generating embeddings from whitespace should not trigger an error, as this may break pipelines. It's better to just leave the whitespace in place and process what we are given. This is consistent with past versions of Ollama. Bug #7578
-
Jesse Gross authored
NUM_PARALEL is currently enforced by the Ollama server process - it will only issue requests to the runner if the maximum number of concurrent requests has not been exceeded. Although this should be sufficient, it is good for the runner to protect its own data structures. Currently, if too many requests get through to the runner, they will just get stuck and never return. This may help with reports of Ollama hanging, though it is unclear how it would actually occur. Bug #7573
-
Michael Yang authored
-
- 12 Nov, 2024 2 commits
-
-
Jesse Gross authored
-
Daniel Hiltgen authored
This adds support for the Jetson JetPack variants into the Go runner
-