"tests/vscode:/vscode.git/clone" did not exist on "a72a057d62d0adb2743b20968c72ae9cb5e5d62b"
- 08 Mar, 2025 2 commits
-
-
Jesse Gross authored
Similar to the llama engine, quantizing the KV cache requires flash attention to be enabled through the Ollama server.
-
Jesse Gross authored
Backends can impose additional alignment requirements on buffer sizes. We should ensure that we meet these or allocations can fail.
-
- 07 Mar, 2025 13 commits
-
-
Jesse Gross authored
-
Michael Yang authored
this ensures the tensor is created on the right buffer type for backends such as cpu
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
temporary until tensor loading can accurately account for vision models
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
some tensors should be created on specific backends to reduce number of copies and improve performance
-
Michael Yang authored
each cache layer creates and maintains its own context instead of using a large context for all layers
-
Michael Yang authored
some tensors are expected to be used in repeating layers but are not themselves repeated. this change copies these tensors into the same backends as their repeating counterparts to minimize copying tensors between backends
-
Michael Yang authored
use a similar strategy as llama.cpp for deciding where tensors should be allocated. this will be improved later to be aware of usable memory before assigning the tensor
-
Jeffrey Morgan authored
-
- 04 Mar, 2025 1 commit
-
-
Michael Yang authored
- output backend system info when initializing the backend. this ensures this information is always present without needing to be called explicitly - convert to structured logging - enumerate devices rather than backends since devices are ordered - track device indices grouped by device name
-
- 03 Mar, 2025 1 commit
-
-
Michael Yang authored
expand backend loading error handling to catch more problems and log them instead of panicing
-
- 02 Mar, 2025 4 commits
-
-
Jesse Gross authored
The GGML flash attention kernel has specific requirements for padding and permutation. This adds support to the KV cache for conforming to these requirements so that flash attention can be enabled. Flash attention can be used in the same situations as the llama engine and is enabled by the user in the same way.
-
Jesse Gross authored
In cases where we allocate a tensor and then fully overwrite it with copied data, it is wasteful to first zero out the memory.
-
Jesse Gross authored
It can be important for a tensor to know what backend it came from - for example, to know if flash attention is enabled.
-
Jesse Gross authored
Prior to performing attention, we need to permute query, key and value. Currently we call Contiguous after each of these permutations, which is correct but expensive. Avoiding the 3 calls to Contiguous increases performance by over 20%. The permutations of query and key do not violate the continuity rules for mulmat and the Contiguous call can be simply removed. Value requires a different permutation and does require Contiguous. However, we can use the copy into the cache as a way to perform this without further overhead. To support this and avoid unexpected tensor shapes that are seen by models, we need tighter integration between attention, cache and backend. Future optimization will also likely need this structure - for example, flash attention has special padding requirements in the cache and other backends may have their own needs. This further contains the operations that go into attention so that these and other optimizations can be handled transparently. Models that have special requirements for attention can still implement their own version of it.
-
- 27 Feb, 2025 5 commits
-
-
Michael Yang authored
update Context.Forward to accept multiple tensors to match Context.Compute signature update Context.Forward to return Context such that it can be chained with Context.Compute
-
Michael Yang authored
-
Michael Yang authored
-
Jeffrey Morgan authored
Fixes sync filters and lowers CUDA version to 11.3 in test.yaml
-
Jeffrey Morgan authored
-
- 25 Feb, 2025 1 commit
-
-
Blake Mizerany authored
During work on our new registry client, I ran into frustrations with CI where a misspelling in a comment caused the linter to fail, which caused the tests to not run, which caused the build to not be cached, which caused the next run to be slow, which caused me to be sad. This commit address these issues, and pulls in some helpful changes we've had in CI on ollama.com for some time now. They are: * Always run tests, even if the other checks fail. Tests are the most important part of CI, and should always run. Failures in tests can be correlated with failures in other checks, and can help surface the root cause of the failure sooner. This is especially important when the failure is platform specific, and the tests are not platform independent. * Check that `go generate` is clean. This prevents 'go generate' abuse regressions. This codebase used to use it to generate platform specific binary build artifacts. Let's make sure that does not happen again and this powerful tool is used correctly, and the generated code is checked in. Also, while adding `go generate` the check, it was revealed that the generated metal code was putting dates in the comments, resulting in non-deterministic builds. This is a bad practice, and this commit fixes that. Git tells us the most important date: the commit date along with other associated changes. * Check that `go mod tidy` is clean. A new job to check that `go mod tidy` is clean was added, to prevent easily preventable merge conflicts or go.mod changes being deferred to a future PR that is unrelated to the change that caused the go.mod to change. * More robust caching. We now cache the go build cache, and the go mod download cache independently. This is because the download cache contains zips that can be unpacked in parallel faster than they can be fetched and extracted by tar. This speeds up the build significantly. The linter is hostile enough. It does not need to also punish us with longer build times due to small failures like misspellings.
-
- 24 Feb, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 21 Feb, 2025 2 commits
-
-
Jesse Gross authored
There are two benefits to doing this: - Provide a library function that models can use, reducing code for each model implementation - Enables a single place to drop in optimized implementations of attention based on the backend or other factors. One is provided for GGML. On CUDA this improves token generation rate by about 3%. It does not have a significant effect on Metal. Co-authored-by:Daniel Hiltgen <daniel@ollama.com>
-
Michael Yang authored
-
- 20 Feb, 2025 2 commits
-
-
Jesse Gross authored
We don't need to create and destroy the GGML scheduler for every context. This introduces extra CPU overhead for every forward pass and extra memory for contexts that don't actually get scheduled (for example, KV caches). We can instead just have one scheduler for the backend and reset it each time we call Compute. This improves token generation performance by 1-2% and removes scheduler create/destroy from profile traces.
-
Jesse Gross authored
Currently the following parameters are in the runner but not used: - numGPULayers - mainGPU - threads - tensorSplit This passes them through to the backend, which is where they would actually get used. However, the GGML backend does not yet do anything with them.
-
- 19 Feb, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 18 Feb, 2025 1 commit
-
-
Michael Yang authored
sapphire rapids has amx support but it ends up having a negative performance impact. emerald rapids also has amx support with a positive performance impact however there's no reasonable way in ggml to differentiate between the two. the impact is small (~6%) so disable amx entirely for simplicity
-
- 14 Feb, 2025 6 commits
-
-
Daniel Hiltgen authored
-
Jeffrey Morgan authored
-
Jesse Gross authored
This provides integration with the new Ollama engine (58245413 next ollama runner (#7913)) and the rest of the Ollama infrastructure such as the runner and Ollama server. In addition, it also builds out the KV cache infrastructure to support requirements of how Ollama runs models such as: - Parallel processing - Memory management for defragmentation and shifting - Multi-modal modals Both old and new engines continue to be supported. By default, only the old engine is used. To enable the new engine: Start the server with the OLLAMA_NEW_ENGINE environment variable set: OLLAMA_NEW_ENGINE=1 ./ollama serve Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M: ./ollama run jessegross/llama3.1
-
Jesse Gross authored
-
Jesse Gross authored
We need to sync before retrieving data after async computation. It is also important to ensure that the Go buffer is not moved by the GC across function calls so we do a synchronous copy.
-
Jesse Gross authored
Passing in a Go buffer is not safe because the garbage collector could free or move the memory while the context is still open. However, if we pass in the size and a nil pointer then GGML will allocate it from the C side.
-