- 22 May, 2025 2 commits
-
-
Jesse Gross authored
FromFloatSlice and FromIntSlice return an error if the shape doesn't match the passed data or if memory can't be allocated. Since these are inputs, the memory being allocated is system memory rather than VRAM. In many cases, the caller can't really handle the error and panics. Empty and Zeros directly panic if they can't allocate memory. This makes things consistent by panicing for the first two cases, removing a fair amount of error handling code. This is also consistent with how Go typically handles these situations.
-
Michael Yang authored
* fix mllama convert - transform attn_gate and ffn_gate - swap attention heads for vision models * fix mllama the mlp gate which was applied in the wrong place
-
- 21 May, 2025 3 commits
-
-
Michael Yang authored
-
Michael Yang authored
* feat: qwen3 dense * feat: qwen3moe * fix llama4 moe
-
Michael Yang authored
setting samebatch on the vision start token is problematic because it will be shared with other inputs that also use images. this will cause the input to be cached and the runner will not see SameBatch. SameBatch will also be incorrect since it may be for a different image. assigning samebatch to the input tokens resolves this by ensure it's assigned correctly to inputs corresponding to the image. not setting same batch correctly may cause panics during inference since images are no longer guaranteed to be in the same batch.
-
- 20 May, 2025 1 commit
-
-
Michael Yang authored
-
- 19 May, 2025 2 commits
-
-
Michael Yang authored
* fix llama model * fix mistral3.1 model do not set default vision layers
-
Jesse Gross authored
Currently, when the backend is created, the tensors are loaded at the same time, which is a slow operation. This separates them to be two steps: - Create backend, including enumerating tensors and memory allocation - Loading tensor data This allows more flexibility in managing model loading.
-
- 16 May, 2025 1 commit
-
-
Michael Yang authored
* get eos_token_id from generation_config.json * refactor * include both ids and strings in trace * comments * remove special case for gemma3 special vocab (#10743)
-
- 15 May, 2025 2 commits
-
-
Jesse Gross authored
For some multimodal models (such as gemma3), we create a single graph that generates the image embedding and then use this in the text model. The embedding tensor is completely opaque to the runner. However, this doesn't work if we need to use the embedding in multiple batches. This can arise if the embedding is larger than the batch size. In these cases (as with llama4), we would like to create views that are more appropriately sized. However, if we do this then the original source tensor is used in multiple graphs, which isn't allowed. To avoid that problem, models with this pattern compute the embedding tensor on first use and recreate the individual views. There is no longer a single vision and text graph. This codifies the pattern of separating vision and text graphs. The logic of computing tensors on demand is moved to the runner, so models no longer have to worry about this. It also gives the runner visibility into the multimodal tensors, which is important for memory management.
-
Michael Yang authored
* panic if trying to pad 4d * fix pixel values padding
-
- 14 May, 2025 2 commits
-
-
Bruce MacDonald authored
-
Michael Yang authored
-
- 13 May, 2025 1 commit
-
-
Michael Yang authored
-
- 12 May, 2025 2 commits
-
-
Bruce MacDonald authored
-
Michael Yang authored
reduce prompt log to trace level
-
- 26 Apr, 2025 1 commit
-
-
Michael Yang authored
-
- 25 Apr, 2025 6 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
Co-authored-by:Patrick Devine <patrick@infrahq.com>
-
Michael Yang authored
-
Michael Yang authored
-
- 24 Apr, 2025 1 commit
-
-
Parth Sareen authored
-
- 18 Apr, 2025 1 commit
-
-
Michael Yang authored
-
- 08 Apr, 2025 1 commit
-
-
Jesse Gross authored
Currently, the KV cache and graph are lazily allocated as needed. The cache is fully allocated on first use of the corresponding layer whereas the graph grows with the size of the context. This can be an issue if another application allocates more VRAM after we do our calculations - Ollama will crash in the middle of inference. If we instead allocate the maximum needed memory at startup of the runner, we will either succeed or fail at that point rather than at some surprising time in the future. Currently, this only generates a worst case batch for text, which means that vision models may get a partial allocation and continue to lazily allocate the rest.
-
- 03 Apr, 2025 2 commits
-
-
Bruce MacDonald authored
Mistral is a popular research lab making open source models. This updates the forward pass of llama architecture models to support both llama models and mistral models by accounting for additional metadata present in mistral models, and finding the correct dimensions for the output projection.
-
Michael Yang authored
-
- 02 Apr, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 21 Mar, 2025 1 commit
-
-
Michael Yang authored
-
- 20 Mar, 2025 3 commits
-
-
Jesse Gross authored
Rather than directly giving the input data to models, we can pass a tensor instead. In the short term, this saves some duplicated code. Longer term, we will want to overlap setting up the next batch with processing of the current one. In this case, we will only have the shape of tensor but it will not be loaded with data at the time of graph generation. By passing only a tensor to models now, we set up this possibility and prevent them from relying on data that they won't have in the future. Although the same could be done for Positions and Outputs, in some cases we either need the raw input data or don't use them at all. Therefore, for now we leave them as they are and allow models to convert them to tensors as needed.
-
Jesse Gross authored
Options is no longer very descriptive of this struct.
-
Jesse Gross authored
Looks like a merge conflict that broke the model.
-
- 19 Mar, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 14 Mar, 2025 2 commits
-
-
Jesse Gross authored
Currently there is a single context per sequence, shared all by all multimodal inputs. Since we build a vision encoder graph per image, with a large number of inputs we can eventually hit the maximum number of graph nodes per context. This changes to use a separate context for each image, ensuring that available resource limits are consistent.
-
Jesse Gross authored
Models may require that a set of inputs all be processed as part of the same batch. For example, if an image has multiple patches with fully connected attention between them, we should not split the batch in the middle of an image. Fixes #9697
-
- 13 Mar, 2025 2 commits
-
-
Michael Yang authored
Co-authored-by:Jeffrey Morgan <jmorganca@gmail.com>
-
Michael Yang authored
-
- 12 Mar, 2025 1 commit
-
-
Bruce MacDonald authored
Softcap isn't in the whitepaper/implementation for the language model so we should remove it. There is no discernible difference in output with it removed.
-
- 11 Mar, 2025 1 commit
-
-
jmorganca authored
-