- 14 Feb, 2025 8 commits
-
-
Jesse Gross authored
-
Jesse Gross authored
We need to sync before retrieving data after async computation. It is also important to ensure that the Go buffer is not moved by the GC across function calls so we do a synchronous copy.
-
Jesse Gross authored
Passing in a Go buffer is not safe because the garbage collector could free or move the memory while the context is still open. However, if we pass in the size and a nil pointer then GGML will allocate it from the C side.
-
Jesse Gross authored
Most tensor backends try to optimize performance by using a lower precision for matmuls. However, some operations (such as kq) on some models are sensitive to this and require full precision.
-
Jesse Gross authored
There are two cases where we may not have an output after computing: - Prompt processing where the length of the input exceeds the batch size - Internal memory management operations such as cache defrag and shift
-
Jesse Gross authored
Currently there is a mixture of int and int64 used when dealing with tensor dimensions and shapes, which causes unnecessary conversions - they all should be the same type. In general, most interfaces (such as Pytorch) use int64 for generality but most implementations (such as CUDA) use int32 for performance. There isn't much benefit to us to being more flexible than the implementations we are likely to run on. In addition, as a practical matter, a model with a tensor with a single dimension larger than 32 bits is unlikely to run on a 32-bit machine.
-
Jesse Gross authored
It is not common to return errors with close/free operations - most people won't check it and even if they did there's probably not much that can do. It's better to not give implementations false expectations.
-
Michael Yang authored
feat: add new Ollama engine using ggml through cgo This change introduces a new way to run pretrained models. It introduces 3 high level interfaces and a bunch of smaller helper interfaces to facilitate this. - `model.Model` defines the interface for a model architecture. Models such as `llama` and `mllama`, which are provided as examples, can implement the model's forward propagation in the `Forward` method. This method will be called to generate completions. This interface can be found in `model/model.go` - `ml.Backend` defines the interface for a backend tensor library, in this case `ggml`. Among other things, a Backend is responsible for loading a pretrained model into hardware (GPU, CPU, etc) and providing an interface for Models to access loaded tensors. This interface can be found in `ml/backend.go` - `ml.Tensor` defines the interface for a tensor and tensor operations This is the first implementation of the new engine. Follow up PRs will implement more features: - non-greedy sampling (#8410) - integration with Ollama and KV caching (#8301) - more model support (#9080) with more coming soon Co-authored-by:Bruce MacDonald <brucewmacdonald@gmail.com>
-
- 13 Feb, 2025 4 commits
-
-
Bùi Đức Nhật authored
-
frob authored
Co-authored-by:Richard Lyons <frob@cloudstaff.com>
-
Anuraag (Rag) Agrawal authored
-
Jeffrey Morgan authored
-
- 12 Feb, 2025 3 commits
-
-
Clinton authored
-
bloominstrong authored
removing the channel tag from the url so it will always go to the current stable channel.
-
Hugues Chocart authored
-
- 11 Feb, 2025 2 commits
-
-
Michael Yang authored
* wrap ggml_backend_load_best in try/catch * ignore non-ollama paths
-
Hugues Chocart authored
-
- 10 Feb, 2025 2 commits
-
-
Jeffrey Morgan authored
-
Hugues Chocart authored
-
- 08 Feb, 2025 4 commits
-
-
Michael Yang authored
ollama requires vcruntime140_1.dll which isn't found on 2019. previously the job used the windows runner (2019) but it explicitly installs 2022 to build the app. since the sign job doesn't actually build anything, it can use the windows-2022 runner instead.
-
Qusai Ismael authored
-
DravenK authored
-
Jeffrey Morgan authored
-
- 07 Feb, 2025 6 commits
-
-
Guddu Kumar authored
-
Azis Alvriyanto authored
-
Michael Yang authored
-
Leisure Linux authored
-
annilq authored
-
CosmicEventHorizon authored
-
- 06 Feb, 2025 11 commits
-
-
Michael Yang authored
-
oslook authored
-
Michael Yang authored
-
Abhinav Pant authored
-
Michael Yang authored
-
Azis Alvriyanto authored
-
Michael Yang authored
the find returns intermediate directories which pulls the parent directories. it also omits files under lib/ollama. switch back to globbing
-
zyphixor authored
-
Diego Pereira authored
Shield the code processing the embedding result from subsequent calls that may overwrite the same buffer to process a second input when retrieving model embeddings.
-
Michael Yang authored
* chore: update gitattributes * chore: add build info source
-
Daniel Lok authored
-