Commits · 6bd0a983cd2cf74f27df2e5a5c80f1794a2ed7ef · OpenDAS / ollama

03 Apr, 2025 1 commit

model: support for mistral-small in the ollama runner · 6bd0a983

Bruce MacDonald authored Mar 14, 2025

Mistral is a popular research lab making open source models. This updates
the forward pass of llama architecture models to support both llama models
and mistral models by accounting for additional metadata present in mistral
models, and finding the correct dimensions for the output projection.

6bd0a983

11 Mar, 2025 2 commits
- set non-causal attention · 0df18004
  Michael Yang authored Mar 07, 2025
  
  0df18004
- gemma2 impl · 5f74d1fd
  Patrick Devine authored Feb 07, 2025
  
  5f74d1fd
08 Mar, 2025 1 commit
- ollamarunner: Quiet debug logging and panic on unimplemented features · 0daaaef8
  Jesse Gross authored Mar 07, 2025
```
Debug logging of every token has previously caused test timeouts
on slower machines.
```
  0daaaef8
07 Mar, 2025 1 commit

model: Don't unconditionally add special tokens · b70fc4d5

Jesse Gross authored Mar 05, 2025

We sometimes tokenize partial strings. For example, with
multimodal inputs, we split the input string around the images
and then tokenize each piece. In these cases, we should only add
the special tokens on the first piece.

b70fc4d5

27 Feb, 2025 1 commit
- model: add bos token if configured · 53d2990d
  Michael Yang authored Feb 26, 2025
  
  53d2990d
14 Feb, 2025 2 commits

vocab: Use int32 for special tokens · 7916f550

Jesse Gross authored Feb 03, 2025

Special tokens are currently read as uint32 from the model metadata.
However, all other parts of the system (including the tokenizer) use
int32 to represent tokens so it is impossible to represent the high
portion of the unsigned range. For consistency and to avoid casts,
we should just use int32 everywhere.

7916f550

next ollama runner (#7913) · 58245413

Michael Yang authored Feb 14, 2025



feat: add new Ollama engine using ggml through cgo

This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.

- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations

This is the first implementation of the new engine. Follow up PRs will implement more features:

- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon
Co-authored-by: Bruce MacDonald <brucewmacdonald@gmail.com>

58245413