- 03 Mar, 2025 3 commits
-
-
Jeffrey Morgan authored
-
Blake Mizerany authored
Previously, using a Registry required a DiskCache to be passed in for use in various methods. This was a bit cumbersome, as the DiskCache is required for most operations, and the DefaultCache is used in most of those cases. This change makes the DiskCache an optional field on the Registry struct. This also changes DefaultCache to initialize on first use. This is to not burden clients with the cost of creating a new cache per use, or having to hold onto a cache for the lifetime of the Registry. Also, slip in some minor docs updates for Trace.
-
Jeffrey Morgan authored
Reverts ccache installation to be done manually via curl instead of using the dnf package manager as this has side effects of prepending ccache's install directory to the front of the PATH
-
- 02 Mar, 2025 7 commits
-
-
Blake Mizerany authored
The extended name format is a superset of the name format that only the client needs to know about, not the server or other dependents of the name package, so move the split logic into the client package. Also, take advantage of knowing about the extended name format to allow the client to use the extended name format when unlinking to verify they are unlinking the manifest with the content they intend.
-
Soulter authored
-
Jesse Gross authored
The GGML flash attention kernel has specific requirements for padding and permutation. This adds support to the KV cache for conforming to these requirements so that flash attention can be enabled. Flash attention can be used in the same situations as the llama engine and is enabled by the user in the same way.
-
Jesse Gross authored
In cases where we allocate a tensor and then fully overwrite it with copied data, it is wasteful to first zero out the memory.
-
Jesse Gross authored
It can be important for a tensor to know what backend it came from - for example, to know if flash attention is enabled.
-
Jesse Gross authored
Prior to performing attention, we need to permute query, key and value. Currently we call Contiguous after each of these permutations, which is correct but expensive. Avoiding the 3 calls to Contiguous increases performance by over 20%. The permutations of query and key do not violate the continuity rules for mulmat and the Contiguous call can be simply removed. Value requires a different permutation and does require Contiguous. However, we can use the copy into the cache as a way to perform this without further overhead. To support this and avoid unexpected tensor shapes that are seen by models, we need tighter integration between attention, cache and backend. Future optimization will also likely need this structure - for example, flash attention has special padding requirements in the cache and other backends may have their own needs. This further contains the operations that go into attention so that these and other optimizations can be handled transparently. Models that have special requirements for attention can still implement their own version of it.
-
Jeffrey Morgan authored
-
- 01 Mar, 2025 3 commits
-
-
Jeffrey Morgan authored
-
Blake Mizerany authored
This commit is a step towards a goal to make names less ceremonial outside of the registry client. Clients of the registry package can treat names as opaque strings, and the registry package will handle parsing, validating, and normalizing names. Ideally we end up with the names package tucked away in an internal package for good. We'll see how things go. Also, this package name is not permanent. This another step in the on-going process of refactoring the server code, and at some point it will most likely be renamed/moved.
-
Bruce MacDonald authored
More validation during the safetensor creation process. Properly handle relative paths (like ./model.safetensors) while rejecting absolute paths Add comprehensive test coverage for various paths No functionality changes for valid inputs - existing workflows remain unaffected Leverages Go 1.24's new os.Root functionality for secure containment
-
- 28 Feb, 2025 9 commits
-
-
Michael Yang authored
defer the cancel to guarantee it runs
-
Michael Yang authored
-
Jeffrey Morgan authored
Focuses initial Blackwell support on compute capability 12.0 which includes the 50x series of GeForce cards. In the future additional compute capabilities may be added
-
Blake Mizerany authored
Also, require the -as flag to be set when importing a model. This prevents the confusing error message "invalid name". Also, allow short names to be used when importing a model and auto-complete the name with the default mask.
-
Michael Yang authored
ggml added an option to disable flash attention so explicitly enable it
-
王贺 authored
-
Jeffrey Morgan authored
-
Blake Mizerany authored
Also, our commit messages have been getting better, but we can do better, and be more consistent. This adds more clarity on how to write commit messages and provides examples of good and bad messages. Also, our contributing guide was lacking helpful guidance on how to start change proposals. This commit adds the start of that section. Soon, we should add a proposal template to the issue tracker with a link back to the proposal section, which should also be expanded upon.
-
Bruce MacDonald authored
As are adding support for weighted sampling we have seen some performance regressions, bypassing the sampler logic for now and defaulting to greedy until we can benchmark the new sampler logic.
-
- 27 Feb, 2025 15 commits
-
-
Parth Sareen authored
-
Michael Yang authored
-
Michael Yang authored
update Context.Forward to accept multiple tensors to match Context.Compute signature update Context.Forward to return Context such that it can be chained with Context.Compute
-
Blake Mizerany authored
This fixes panics introduced in 2412adf4 when Gin ungracefully assumes that the http.ResponseWriter implements http.CloseNotifier and http.Flusher, which our new statusCodeRecorder does not. This is a temporary fix until we can pour the rest of the Gin out.
-
Michael Yang authored
-
Jesse Gross authored
Otherwise on Linux I get: go: download go1.24 for linux/amd64: toolchain not available
-
Blake Mizerany authored
This commit introduces a new API implementation for handling interactions with the registry and the local model cache. The new API is located in server/internal/registry. The package name is "registry" and should be considered temporary; it is hidden and not bleeding outside of the server package. As the commits roll in, we'll start consuming more of the API and then let reverse osmosis take effect, at which point it will surface closer to the root level packages as much as needed.
-
Steven Hartland authored
Fix the examples link in the go package documentation for the API.
-
Eries Trisnadi authored
-
Michael Yang authored
-
Michael Yang authored
-
Daniel Hiltgen authored
* Windows ARM build Skip cmake, and note it's unused in the developer docs. * Win: only check for ninja when we need it On windows ARM, the cim lookup fails, but we don't need ninja anyway.
-
Blake Mizerany authored
The linter is secondary to the tests, so it should run after the tests, exposing test failures faster.
-
Jeffrey Morgan authored
Fixes sync filters and lowers CUDA version to 11.3 in test.yaml
-
Jeffrey Morgan authored
-
- 26 Feb, 2025 2 commits
-
-
Gordon Kamer authored
-
Daniel Hiltgen authored
* Add cuda Blackwell architecture for v12 * Win: Split rocm out to separate zip file * Reduce CC matrix The 6.2 and 7.2 architectures only appear on Jetsons, so they were wasting space. The 5.0 should be forward compatible with 5.2 and 5.3.
-
- 25 Feb, 2025 1 commit
-
-
Jeffrey Morgan authored
-