Commits · fa7776fd2458fc3a8aeb7f12e4bc65b439955319 · OpenDAS / ollama

05 Aug, 2025 1 commit

Michael Yang authored Aug 05, 2025



* bf16

* tests

* gpt-oss

* enable gptoss for engine

* rough estimate

* convert to mxfp4

* handle safetensors U8

* clamp glu/linear

* update tokenizer

* MXFP4 support

This implements the Open Compute Microscaling (MX) FP4 format
as a tensor type with backend implementations focusing
on mulmat and mulmatid on CPU, CUDA, and Metal.

* Unit tests for MXFP4 support

This exercises various operations and shapes on both CPU and GPU (if detected
on the system)

* cuda graph

* unit test adjustments

* cuda: optimize memory access

Read 4 bytes at a time (8 elements) when performing mul_mat_vec_mxfp4

* mac: fix crash on old macos versions

cblas_sgemm is only supported on v13.3 and up, however bf16 is
only supported on v14+ so we were falling back to ggml-blas and
crashing on bf16 tensors.  Checking for the function being null
seems to be the simplest way to condittionally avoid registering the
backend.

* server: Minimum context length for gptoss

This model requires a minimum context length of 8192 to function
effectively. Users can set higher values through all normal mechanisms
but lower values will be silently reset.

* ggml: Multiply by numParallel for gptoss sliding window

When computing the graph size estimate, the context size is already
multiplied by numParallel so estimates reflect that. However, since
sliding window models use a smaller, fixed context size, they need
to manually take numParallel into account.

* gpt-oss integration

includes harmony parser and thinking levels, etc.

* fix sync

* fix tests

* fix lint

---------
Co-authored-by: Daniel Hiltgen <daniel@ollama.com>
Co-authored-by: Jesse Gross <jesse@ollama.com>
Co-authored-by: Devon Rifkin <drifkin@drifkin.net>

fa7776fd

23 Jul, 2025 1 commit
- s#x/exp/maps#maps# (#11506) · 6c733bf0
  Michael Yang authored Jul 23, 2025
  
  6c733bf0
07 Jul, 2025 1 commit
- template: add tool result compatibility (#11294) · 1f91cb0c
  Parth Sareen authored Jul 07, 2025
  
  1f91cb0c
29 May, 2025 1 commit

add thinking support to the api and cli (#10584) · 5f57b0ef

Devon Rifkin authored May 28, 2025

- Both `/api/generate` and `/api/chat` now accept a `"think"`
  option that allows specifying whether thinking mode should be on or
  not
- Templates get passed this new option so, e.g., qwen3's template can
  put `/think` or `/no_think` in the system prompt depending on the
  value of the setting
- Models' thinking support is inferred by inspecting model templates.
  The prefix and suffix the parser uses to identify thinking support is
  also automatically inferred from templates
- Thinking control & parsing is opt-in via the API to prevent breaking
  existing API consumers. If the `"think"` option is not specified, the
  behavior is unchanged from previous versions of ollama
- Add parsing for thinking blocks in both streaming/non-streaming mode
  in both `/generate` and `/chat`
- Update the CLI to make use of these changes. Users can pass `--think`
  or `--think=false` to control thinking, or during an interactive
  session they can use the commands `/se...

5f57b0ef

20 Mar, 2025 1 commit
- templates: add autotemplate for gemma3 (#9880) · f8c3dbe5
  Patrick Devine authored Mar 20, 2025
```
This change allows the gemma3 template to be autodetected during `ollama
create`.
```
  f8c3dbe5
14 Feb, 2025 1 commit

next ollama runner (#7913) · 58245413

Michael Yang authored Feb 14, 2025



feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

58245413

16 Jan, 2025 1 commit
- convert: import support for command-r models from safetensors (#6063) · 93a8daf2
  Josh authored Jan 15, 2025
```
---------
Co-authored-by: Patrick Devine <patrick@infrahq.com>
```
  93a8daf2
18 Oct, 2024 1 commit

image processing for llama3.2 (#6963) · c7cb0f06

Patrick Devine authored Oct 18, 2024


Co-authored-by: jmorganca <jmorganca@gmail.com>
Co-authored-by: Michael Yang <mxyng@pm.me>
Co-authored-by: Jesse Gross <jesse@ollama.com>

c7cb0f06

28 Aug, 2024 1 commit
- add llama3.1 chat template (#6545) · 7416ced7
  Patrick Devine authored Aug 28, 2024
  
  7416ced7
27 Aug, 2024 1 commit
- update templates to use messages · 413ae39f
  Michael Yang authored Aug 27, 2024
  
  413ae39f
02 Aug, 2024 1 commit
- lint · b732beba
  Michael Yang authored Aug 01, 2024
  
  b732beba
20 Jul, 2024 1 commit
- preserve last assistant message (#5802) · 20090f31
  Jeffrey Morgan authored Jul 19, 2024
  
  20090f31
17 Jul, 2024 2 commits
- marshal json automatically for some template values (#5758) · b2554455
  Michael Yang authored Jul 17, 2024
  
  b2554455
- stub response (#5750) · 5b82960d
  Michael Yang authored Jul 17, 2024
  
  5b82960d
16 Jul, 2024 2 commits
- add suffix support to generate endpoint · d290e875
  Michael Yang authored Jun 20, 2024
```
this change is triggered by the presence of "suffix", particularly
useful for code completion tasks
```
  d290e875
- tools test · ef5136a7
  Michael Yang authored Jul 15, 2024
  
  ef5136a7
15 Jul, 2024 1 commit
- tools · d02bbebb
  Michael Yang authored Jun 20, 2024
  
  d02bbebb
13 Jul, 2024 1 commit

fix system prompt (#5662) · 22c5451f

Michael Yang authored Jul 12, 2024



* fix system prompt

* execute template when hitting previous roles

* fix tests

---------
Co-authored-by: jmorganca <jmorganca@gmail.com>

22c5451f

12 Jul, 2024 3 commits
- autodetect stop parameters from template · ebc529cb
  Michael Yang authored Jul 05, 2024
  
  ebc529cb
- template: preprocess message and collect system · 36c87c43
  Michael Yang authored Jul 12, 2024
  
  36c87c43
- rename aggregate to contents · 5056bb9c
  Michael Yang authored Jul 11, 2024
  
  5056bb9c
11 Jul, 2024 4 commits
- revert embedded templates to use prompt/response · 57ec6901
  Michael Yang authored Jul 11, 2024
```
This reverts commit 19753c18.

for compat. messages will be added at a later date
```
  57ec6901
- do no automatically aggregate system messages · e64f9ebb
  Michael Yang authored Jul 11, 2024
  
  e64f9ebb
- update embedded templates · 19753c18
  Michael Yang authored Jul 10, 2024
  
  19753c18
- add system prompt to first legacy template · 41be2809
  Michael Yang authored Jul 10, 2024
  
  41be2809
05 Jul, 2024 4 commits
- update named templates · fb6cbc02
  Michael Yang authored Jun 27, 2024
  
  fb6cbc02
- no funcs · 326363b3
  Michael Yang authored Jul 03, 2024
  
  326363b3
- comments · 2c3fe1fd
  Michael Yang authored Jun 20, 2024
  
  2c3fe1fd
- update message processing · 269ed6e6
  Michael Yang authored Jun 17, 2024
  
  269ed6e6
01 Jul, 2024 2 commits
- add capabilities · a30915bd
  Michael Yang authored Jun 11, 2024
  
  a30915bd
- rename templates to template · 58e3fff3
  Michael Yang authored Jun 10, 2024
  
  58e3fff3