next ollama runner (#7913)
feat: add new Ollama engine using ggml through cgo
This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this.
- `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go`
- `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go`
- `ml.Tensor` defines the interface for a tensor and tensor operations
This is the first implementation of the new engine. Follow up PRs will implement more features:
- non-greedy sampling (#8410)
- integration with Ollama and KV caching (#8301)
- more model support (#9080) with more coming soon
Co-authored-by:
Bruce MacDonald <brucewmacdonald@gmail.com>
Showing
llm/ggml_test.go
deleted
100644 → 0
ml/backend.go
0 → 100644
ml/backend/backend.go
0 → 100644
ml/backend/ggml/ggml.go
0 → 100644
ml/nn/convolution.go
0 → 100644
ml/nn/embedding.go
0 → 100644
ml/nn/linear.go
0 → 100644
ml/nn/normalization.go
0 → 100644
model/llama/model.go
0 → 100644
model/mllama/model.go
0 → 100644
model/mllama/model_text.go
0 → 100644
model/mllama/model_vision.go
0 → 100644
model/model.go
0 → 100644
model/model_test.go
0 → 100644
model/process_text.go
0 → 100644
Please register or sign in to comment