- 04 Aug, 2025 2 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
- 26 Jun, 2025 3 commits
-
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
-
Michael Yang authored
* update patches * cherry pick metal mean kernel * cherry pick cuda mean kernel * gemma3n
-
- 20 Jun, 2025 1 commit
-
-
Michael Yang authored
* Reapply "feat: incremental gguf parser (#10822)" (#11114) This reverts commit a6e64fbd. * fix older ggufs
-
- 18 Jun, 2025 1 commit
-
-
Jeffrey Morgan authored
This reverts commit 6b04cad7.
-
- 16 Jun, 2025 1 commit
-
-
Michael Yang authored
* ggml: test write gguf order * ggml: fix write tensor order
-
- 12 Jun, 2025 1 commit
-
-
Michael Yang authored
* incremental gguf parser * gguf: update test to not rely on gguf on disc * re-use existing create gguf * read capabilities from gguf kv * kv exists * update tests * s/doneFunc/successFunc/g * new buffered reader --------- Co-authored-by:Bruce MacDonald <brucewmacdonald@gmail.com>
-
- 19 May, 2025 1 commit
-
-
Jesse Gross authored
Currently, when the backend is created, the tensors are loaded at the same time, which is a slow operation. This separates them to be two steps: - Create backend, including enumerating tensors and memory allocation - Loading tensor data This allows more flexibility in managing model loading.
-
- 14 May, 2025 3 commits
-
-
Bruce MacDonald authored
-
Bruce MacDonald authored
-
Michael Yang authored
-
- 12 May, 2025 1 commit
-
-
Daniel Hiltgen authored
The quantization PR didn't block all unsupported file types, which this PR fixes. It also updates the API docs to reflect the now reduced set of supported types.
-
- 07 May, 2025 1 commit
-
-
Daniel Hiltgen authored
err in the go routine should not be shared with the outer scope
-
- 06 May, 2025 1 commit
-
-
Daniel Hiltgen authored
* Move quantization logic to GGML via new backend This moves the model aware logic to Go code and calls GGMLs quantization code for model creation. * Remove "add model quantizations" This is no longer needed now that quantization is implemented in Go+GGML code directly.
-
- 05 May, 2025 1 commit
-
-
Jesse Gross authored
Most of the time this is not an error.
-
- 01 May, 2025 1 commit
-
-
Michael Yang authored
* add gguf_test * fix padding padding was being added to offset but not to the running count
-
- 27 Apr, 2025 1 commit
-
-
Devon Rifkin authored
If it's an array, it uses the max value in the array If array values for head counts becomes more popular, we can consider a more invasive change like #10225 to calculate more accurate estimates. Fixes: #9984
-
- 25 Apr, 2025 9 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
use a default of 1024 when asking for zero is confusing since most calls seem to assume 0 means do not ready any data
-
Michael Yang authored
-
Michael Yang authored
-
- 16 Apr, 2025 1 commit
-
-
Michael Yang authored
-
- 03 Apr, 2025 2 commits
-
-
Bruce MacDonald authored
Mistral is a popular research lab making open source models. This updates the forward pass of llama architecture models to support both llama models and mistral models by accounting for additional metadata present in mistral models, and finding the correct dimensions for the output projection.
-
Michael Yang authored
-
- 26 Mar, 2025 1 commit
-
-
Jesse Gross authored
Gemma3 uses sliding windows for its context on 5/6 layers, significantly reducing memory usage but leading to uneven usage across layers, which makes allocation to the correct GPU difficult. We currently estimate very conservatively by assuming all layers are consistent at the max size. Llama3.2-vision is also inconsistent between self attention and cross attention layers - at moment, we calculate the correct total size and then average this across layers. In some cases, this may lead to crashes if a large layer is placed on a GPU sized by the average. This allows memory estimation to calculate per-layer KV cache size and take this account when placing layers onto GPUs. We already do this for weights that vary per-tensor, so this is a logical extension. Fixes #9730 Fixes #9890
-
- 13 Mar, 2025 6 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
the largest operation is by far (q @ k) so just count that for simplicity
-
Michael Yang authored
-
Michael Yang authored
-
Patrick Devine authored
Add metadata and tensor information to the show command to be able to see more information about a model. This outputs the same data as shown on the model details page on ollama.com
-
- 11 Mar, 2025 2 commits
-
-
Daniel Hiltgen authored
-
Patrick Devine authored
-