- 27 Feb, 2025 11 commits
-
-
Michael Yang authored
-
Jesse Gross authored
Otherwise on Linux I get: go: download go1.24 for linux/amd64: toolchain not available
-
Blake Mizerany authored
This commit introduces a new API implementation for handling interactions with the registry and the local model cache. The new API is located in server/internal/registry. The package name is "registry" and should be considered temporary; it is hidden and not bleeding outside of the server package. As the commits roll in, we'll start consuming more of the API and then let reverse osmosis take effect, at which point it will surface closer to the root level packages as much as needed.
-
Steven Hartland authored
Fix the examples link in the go package documentation for the API.
-
Eries Trisnadi authored
-
Michael Yang authored
-
Michael Yang authored
-
Daniel Hiltgen authored
* Windows ARM build Skip cmake, and note it's unused in the developer docs. * Win: only check for ninja when we need it On windows ARM, the cim lookup fails, but we don't need ninja anyway.
-
Blake Mizerany authored
The linter is secondary to the tests, so it should run after the tests, exposing test failures faster.
-
Jeffrey Morgan authored
Fixes sync filters and lowers CUDA version to 11.3 in test.yaml
-
Jeffrey Morgan authored
-
- 26 Feb, 2025 2 commits
-
-
Gordon Kamer authored
-
Daniel Hiltgen authored
* Add cuda Blackwell architecture for v12 * Win: Split rocm out to separate zip file * Reduce CC matrix The 6.2 and 7.2 architectures only appear on Jetsons, so they were wasting space. The 5.0 should be forward compatible with 5.2 and 5.3.
-
- 25 Feb, 2025 10 commits
-
-
Jeffrey Morgan authored
-
Blake Mizerany authored
During work on our new registry client, I ran into frustrations with CI where a misspelling in a comment caused the linter to fail, which caused the tests to not run, which caused the build to not be cached, which caused the next run to be slow, which caused me to be sad. This commit address these issues, and pulls in some helpful changes we've had in CI on ollama.com for some time now. They are: * Always run tests, even if the other checks fail. Tests are the most important part of CI, and should always run. Failures in tests can be correlated with failures in other checks, and can help surface the root cause of the failure sooner. This is especially important when the failure is platform specific, and the tests are not platform independent. * Check that `go generate` is clean. This prevents 'go generate' abuse regressions. This codebase used to use it to generate platform specific binary build artifacts. Let's make sure that does not happen again and this powerful tool is used correctly, and the generated code is checked in. Also, while adding `go generate` the check, it was revealed that the generated metal code was putting dates in the comments, resulting in non-deterministic builds. This is a bad practice, and this commit fixes that. Git tells us the most important date: the commit date along with other associated changes. * Check that `go mod tidy` is clean. A new job to check that `go mod tidy` is clean was added, to prevent easily preventable merge conflicts or go.mod changes being deferred to a future PR that is unrelated to the change that caused the go.mod to change. * More robust caching. We now cache the go build cache, and the go mod download cache independently. This is because the download cache contains zips that can be unpacked in parallel faster than they can be fetched and extracted by tar. This speeds up the build significantly. The linter is hostile enough. It does not need to also punish us with longer build times due to small failures like misspellings.
-
Daniel Hiltgen authored
* Bump cuda and rocm versions Update ROCm to linux:6.3 win:6.2 and CUDA v12 to 12.8. Yum has some silent failure modes, so largely switch to dnf. * Fix windows build script
-
José Pekkarinen authored
centos-7 images have been deprecated upstream and replaced with almalinux-8 images instead, requiring some small extra work. Signed-off-by:José Pekkarinen <jose.pekkarinen@foxhound.fi>
-
Chuanhui Liu authored
-
Michael Yang authored
this was accidentally removed when moving fs/ggml from its previous location
-
Pavol Rusnak authored
CUDA 12.x still supports Compute Capability 5.0, 5.2 and 5.3, so let's build for these architectures as well
-
frob authored
Co-authored-by:Richard Lyons <frob@cloudstaff.com>
-
Blake Mizerany authored
This commit copies (without history) the bmizerany/ollama-go repository with the intention of integrating it into the ollama as a replacement for the pushing, and pulling of models, and management of the cache they are pushed and pulled from. New homes for these packages will be determined as they are integrated and we have a better understanding of proper package boundaries.
-
Parth Sareen authored
-
- 24 Feb, 2025 3 commits
-
-
Parth Sareen authored
* envconfig: allow setting context length through env var
-
Blake Mizerany authored
-
Jeffrey Morgan authored
-
- 22 Feb, 2025 2 commits
-
-
Jeffrey Morgan authored
-
Blake Mizerany authored
The route assembly in Handler lacked clear organization making it difficult scan for routes and their relationships to each other. This commit aims to fix that by reordering the assembly of routes to group them by category and purpose. Also, be more specific about what "config" refers to (it is about CORS if you were wondering... I was.)
-
- 21 Feb, 2025 3 commits
-
-
Jesse Gross authored
There are two benefits to doing this: - Provide a library function that models can use, reducing code for each model implementation - Enables a single place to drop in optimized implementations of attention based on the backend or other factors. One is provided for GGML. On CUDA this improves token generation rate by about 3%. It does not have a significant effect on Metal. Co-authored-by:Daniel Hiltgen <daniel@ollama.com>
-
Michael Yang authored
-
Junyan Qin (Chin) authored
-
- 20 Feb, 2025 9 commits
-
-
Jesse Gross authored
Currently Rows is called as the last step in a model computation to get the values for the output tokens. However, if we move it earlier in the process then we can trim out computations that never get used. This is similar to how models are defined in llama.cpp. Changing the model definition in this way improves token generation performance by approximately 8%.
-
Jesse Gross authored
We don't need to create and destroy the GGML scheduler for every context. This introduces extra CPU overhead for every forward pass and extra memory for contexts that don't actually get scheduled (for example, KV caches). We can instead just have one scheduler for the backend and reset it each time we call Compute. This improves token generation performance by 1-2% and removes scheduler create/destroy from profile traces.
-
Jesse Gross authored
Currently the following parameters are in the runner but not used: - numGPULayers - mainGPU - threads - tensorSplit This passes them through to the backend, which is where they would actually get used. However, the GGML backend does not yet do anything with them.
-
Bruce MacDonald authored
Added unit tests to verify error handling behavior in the Client.stream and Client.do methods. Tests cover various error scenarios including: - Error responses with status codes >= 400 - Error messages with successful status codes - Empty error messages - Successful responses
-
Michael Yang authored
clang outputs are faster. we were previously building with clang via gcc wrapper in cgo but this was missed during the build updates so there was a drop in performance
-
frob authored
-
danielekp authored
-
Lucas Hahn authored
-
Michael Yang authored
-