- 09 Dec, 2025 1 commit
-
-
nicole pardal authored
-
- 04 Dec, 2025 1 commit
-
-
Patrick Devine authored
This change adds the ability for `ollama create` to convert models that use the DeepSeek2 architecture (specifically DeepSeekV3 and DeepSeek-R1).
-
- 19 Nov, 2025 1 commit
-
-
Michael Yang authored
-
- 29 Oct, 2025 1 commit
-
-
Michael Yang authored
-
- 05 Aug, 2025 1 commit
-
-
Michael Yang authored
* bf16 * tests * gpt-oss * enable gptoss for engine * rough estimate * convert to mxfp4 * handle safetensors U8 * clamp glu/linear * update tokenizer * MXFP4 support This implements the Open Compute Microscaling (MX) FP4 format as a tensor type with backend implementations focusing on mulmat and mulmatid on CPU, CUDA, and Metal. * Unit tests for MXFP4 support This exercises various operations and shapes on both CPU and GPU (if detected on the system) * cuda graph * unit test adjustments * cuda: optimize memory access Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4 * mac: fix crash on old macos versions cblas_sgemm is only supported on v13.3 and up, however bf16 is only supported on v14+ so we were falling back to ggml-blas and crashing on bf16 tensors. Checking for the function being null seems to be the simplest way to condittionally avoid registering the backend. * server: Minimum context length for gptoss This model requires a minimum context length of 8192 to function effectively. Users can set higher values through all normal mechanisms but lower values will be silently reset. * ggml: Multiply by numParallel for gptoss sliding window When computing the graph size estimate, the context size is already multiplied by numParallel so estimates reflect that. However, since sliding window models use a smaller, fixed context size, they need to manually take numParallel into account. * gpt-oss integration includes harmony parser and thinking levels, etc. * fix sync * fix tests * fix lint --------- Co-authored-by:
Daniel Hiltgen <daniel@ollama.com> Co-authored-by:
Jesse Gross <jesse@ollama.com> Co-authored-by:
Devon Rifkin <drifkin@drifkin.net>
-
- 26 Jun, 2025 1 commit
-
-
Michael Yang authored
* update patches * cherry pick metal mean kernel * cherry pick cuda mean kernel * gemma3n
-
- 16 May, 2025 1 commit
-
-
Michael Yang authored
* get eos_token_id from generation_config.json * refactor * include both ids and strings in trace * comments * remove special case for gemma3 special vocab (#10743)
-
- 14 May, 2025 2 commits
-
-
Bruce MacDonald authored
-
Michael Yang authored
-
- 06 May, 2025 1 commit
-
-
Daniel Hiltgen authored
* Move quantization logic to GGML via new backend This moves the model aware logic to Go code and calls GGMLs quantization code for model creation. * Remove "add model quantizations" This is no longer needed now that quantization is implemented in Go+GGML code directly.
-
- 25 Apr, 2025 2 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
- 03 Apr, 2025 1 commit
-
-
Bruce MacDonald authored
Mistral is a popular research lab making open source models. This updates the forward pass of llama architecture models to support both llama models and mistral models by accounting for additional metadata present in mistral models, and finding the correct dimensions for the output projection.
-
- 18 Mar, 2025 1 commit
-
-
Bruce MacDonald authored
When a model's architecture cannot be converted return the name of the unsupported arch in the error message.
-
- 11 Mar, 2025 4 commits
-
-
jmorganca authored
-
Patrick Devine authored
-
Michael Yang authored
-
Patrick Devine authored
-
- 14 Feb, 2025 1 commit
-
-
Michael Yang authored
feat: add new Ollama engine using ggml through cgo This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this. - `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go` - `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go` - `ml.Tensor` defines the interface for a tensor and tensor operations This is the first implementation of the new engine. Follow up PRs will implement more features: - non-greedy sampling (#8410) - integration with Ollama and KV caching (#8301) - more model support (#9080) with more coming soon Co-authored-by:Bruce MacDonald <brucewmacdonald@gmail.com>
-
- 16 Jan, 2025 1 commit
-
-
Josh authored
--------- Co-authored-by:Patrick Devine <patrick@infrahq.com>
-
- 14 Jan, 2025 1 commit
-
-
Bruce MacDonald authored
Add native support for converting Qwen2 family models (including Qwen2.5) from safetensors to gguf format so we can run it.
-
- 10 Sep, 2024 1 commit
-
-
Patrick Devine authored
-
- 23 Aug, 2024 1 commit
-
-
Patrick Devine authored
-
- 21 Aug, 2024 2 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
- 12 Aug, 2024 1 commit
-
-
Michael Yang authored
-
- 31 Jul, 2024 3 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Michael Yang authored
-
- 04 Jun, 2024 1 commit
-
-
Michael Yang authored
-
- 20 May, 2024 5 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Patrick Devine authored
-
Patrick Devine authored
-
Patrick Devine authored
-
- 06 May, 2024 1 commit
-
-
Michael Yang authored
- FROM /path/to/{safetensors,pytorch} - FROM /path/to/fp{16,32}.bin - FROM model:fp{16,32}
-
- 24 Apr, 2024 1 commit
-
-
Patrick Devine authored
-
- 15 Apr, 2024 1 commit
-
-
Patrick Devine authored
-
- 06 Apr, 2024 1 commit
-
-
Michael Yang authored
-
- 01 Apr, 2024 1 commit
-
-
Patrick Devine authored
-