- 13 Dec, 2025 2 commits
-
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
-
- 12 Dec, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 09 Dec, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 08 Dec, 2025 1 commit
-
-
Michael Yang authored
change to a flatter directory structure and group the options with the function update models to call rope in one place
-
- 28 Oct, 2025 2 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
- 19 Sep, 2025 1 commit
-
-
Patrick Devine authored
* gemma: fix rope scaling for qat models * gofumpt yourself
-
- 17 Sep, 2025 1 commit
-
-
Michael Yang authored
* fix(llama): rope scale * spm llama * skip moe models * cleanup
-
- 16 Sep, 2025 1 commit
-
-
Michael Yang authored
* use ggml_*_split activations when possible * forward qkv
-
- 15 Sep, 2025 1 commit
-
-
Michael Yang authored
this cleans up the model interface slightly without too much impact in other areas
-
- 04 Sep, 2025 1 commit
-
-
Michael Yang authored
* ollama: add embeddings
-
- 20 May, 2025 1 commit
-
-
Michael Yang authored
-
- 15 May, 2025 1 commit
-
-
Jesse Gross authored
For some multimodal models (such as gemma3), we create a single graph that generates the image embedding and then use this in the text model. The embedding tensor is completely opaque to the runner. However, this doesn't work if we need to use the embedding in multiple batches. This can arise if the embedding is larger than the batch size. In these cases (as with llama4), we would like to create views that are more appropriately sized. However, if we do this then the original source tensor is used in multiple graphs, which isn't allowed. To avoid that problem, models with this pattern compute the embedding tensor on first use and recreate the individual views. There is no longer a single vision and text graph. This codifies the pattern of separating vision and text graphs. The logic of computing tensors on demand is moved to the runner, so models no longer have to worry about this. It also gives the runner visibility into the multimodal tensors, which is important for memory management.
-
- 13 May, 2025 1 commit
-
-
Michael Yang authored
-
- 25 Apr, 2025 1 commit
-
-
Michael Yang authored
-
- 03 Apr, 2025 2 commits
-
-
Bruce MacDonald authored
Mistral is a popular research lab making open source models. This updates the forward pass of llama architecture models to support both llama models and mistral models by accounting for additional metadata present in mistral models, and finding the correct dimensions for the output projection.
-
Michael Yang authored
-
- 02 Apr, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 20 Mar, 2025 1 commit
-
-
Jesse Gross authored
Options is no longer very descriptive of this struct.
-
- 14 Mar, 2025 1 commit
-
-
Jesse Gross authored
Models may require that a set of inputs all be processed as part of the same batch. For example, if an image has multiple patches with fully connected attention between them, we should not split the batch in the middle of an image. Fixes #9697
-
- 12 Mar, 2025 1 commit
-
-
Bruce MacDonald authored
Softcap isn't in the whitepaper/implementation for the language model so we should remove it. There is no discernible difference in output with it removed.
-
- 11 Mar, 2025 13 commits
-
-
Jesse Gross authored
Currently we are using positions, which are relative to a sequence and may not be unique.
-
Jesse Gross authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Patrick Devine authored
-
Michael Yang authored
-
Patrick Devine authored
-
Michael Yang authored
-
Jesse Gross authored
-
Michael Yang authored
-
Patrick Devine authored
-