- 07 Apr, 2025 4 commits
-
-
Alex Rozgo authored
-
Devon Rifkin authored
CONTRIBUTING: fix code block formatting
-
Devon Rifkin authored
There were only 3 spaces instead of 4, so the example was being considered to include html elements
-
Michael Yang authored
-
- 05 Apr, 2025 1 commit
-
-
Daniel Hipke authored
improves model loading times on network-based filesystems such as GCS fuse by creating a dedicated file descriptor for each section of the file being read, reducing seeking
-
- 03 Apr, 2025 4 commits
-
-
Bruce MacDonald authored
Mistral is a popular research lab making open source models. This updates the forward pass of llama architecture models to support both llama models and mistral models by accounting for additional metadata present in mistral models, and finding the correct dimensions for the output projection.
-
Michael Yang authored
fs: move ml.Config to fs package
-
Michael Yang authored
-
Bruce MacDonald authored
No functional change. Many different done reasons can be set at the runner level, so rather than obsuring them we should return them to the server process and let it choose what to do with the done reason. This separates the API concerns from the runner.
-
- 02 Apr, 2025 5 commits
-
-
Jeffrey Morgan authored
-
jmorganca authored
The sliding window cache trims entries that are outside the window for the latest token. This works when we are extending the cache, such as when the conversation continues. However, if we have a partial overlap in conversation (including the BOS tokens), then we resume from a past point in the conversation and the needed tokens are no longer stored in memory. This verifies that the new window overlaps with the old one before reusing the cache. Co-authored-by:Jesse Gross <jesse@ollama.com>
-
Jesse Gross authored
When truncating inputs to the the context window at the beginning of a sequence, we remove the minimum amount possible. However, this may cause us to truncate to the middle of a set of inputs that the model specified should not be split up. To avoid this, we need to remove the rest of the partial batch.
-
Bruce MacDonald authored
Both interface{} and any (which is just an alias for interface{} introduced in Go 1.18) represent the empty interface that all types satisfy. -
IsAurora6 authored
-
- 01 Apr, 2025 4 commits
-
-
Bruce MacDonald authored
With support for multimodal models becoming more varied and common it is important for clients to be able to easily see what capabilities a model has. Retuning these from the show endpoint will allow clients to easily see what a model can do.
-
Ilian authored
-
Abyss-c0re authored
Co-authored-by:Bruce MacDonald <brucewmacdonald@gmail.com>
-
湛露先生 authored
Signed-off-by:zhanluxianshen <zhanluxianshen@163.com>
-
- 31 Mar, 2025 5 commits
-
-
Bruce MacDonald authored
Clear KV cache when shift operation is not supported by model. Added KvCacheCanShift() check to handle models that can't perform cache shifts, falling back to full cache clear while preserving logical token history to maintain expected behavior when context window fills up.
-
Blake Mizerany authored
This change adds tracking of download chunks during the pull process so that subsequent pulls can skip downloading already completed chunks. This works across restarts of ollama. Currently, download state will be lost if a prune is triggered during a pull (e.g. restart or remove). This issue should be addressed in a follow-up PR.
-
Jesse Gross authored
If we have an error after creating a new sequence but before finding a slot for it, we return without releasing the semaphore. This reduces our parallel sequences and eventually leads to deadlock. In practice this should never happen because once we have acquired the semaphore, we should always be able to find a slot. However, the code is clearly not correct.
-
Jesse Gross authored
With the llama runner, we can generate up to NUM_PARALLEL batches at once, which will then get broken up to into individual batches to get executed by llama.cpp (i.e. we add up to 2048 tokens and this gets split into 4 batches of 512 tokens at default settings). This splitting can improve parallelism on multi-GPU systems because the individual batches can move though the pipeline without blocking on the first one to fully complete. However, we don't yet support this in the Ollama runner, partially because it makes it hard to enforce model-specified batch constraints, which didn't exist previously. The result is that we will try to execute the full, unsplit batch. This could result in out of memory or insufficient KV cache space errors. This triggers batch breaking when the total inputs from all sequences exceeds the batch size, rather than per-sequence. In order to ensure fairness, it also reintroduces round-robinning around sequences so that we don't let one busy sequence starve the others.
-
Leandro Borges Ferreira authored
-
- 28 Mar, 2025 1 commit
-
-
CYJiang authored
Co-authored-by:Bruce MacDonald <brucewmacdonald@gmail.com>
-
- 27 Mar, 2025 3 commits
-
-
Jesse Gross authored
Model implementations should use Input for all of their tensors supplied to the model. This includes tensors that relate to the outputs, which is confusing since there is also an Output funciton. Since Output is only used internally in GGML and not used by any model implementations, we can remove it from the interface to reduce confusion.
-
saman-amd authored
-
Parth Sareen authored
-
- 26 Mar, 2025 5 commits
-
-
molbal authored
-
Hengky Steen authored
-
Jesse Gross authored
Gemma3 uses sliding windows for its context on 5/6 layers, significantly reducing memory usage but leading to uneven usage across layers, which makes allocation to the correct GPU difficult. We currently estimate very conservatively by assuming all layers are consistent at the max size. Llama3.2-vision is also inconsistent between self attention and cross attention layers - at moment, we calculate the correct total size and then average this across layers. In some cases, this may lead to crashes if a large layer is placed on a GPU sized by the average. This allows memory estimation to calculate per-layer KV cache size and take this account when placing layers onto GPUs. We already do this for weights that vary per-tensor, so this is a logical extension. Fixes #9730 Fixes #9890
-
Jesse Gross authored
-
Jesse Gross authored
When computing the size of the cache for sliding window attention, we don't need to multiple the batch size by the number of parallel sequences - the batch size is constant. This also simplifies the check for whether to allocate the cache size based on capacity or window size as the batch size is already incorporated into the capacity when handled by the runner.
-
- 25 Mar, 2025 1 commit
-
-
copeland3300 authored
-
- 24 Mar, 2025 1 commit
-
-
Matheus C. França authored
-
- 21 Mar, 2025 6 commits
-
-
Blake Mizerany authored
Close chunked writers as soon as downloads complete, rather than deferring closure until Pull exits. This prevents exhausting file descriptors when pulling many layers. Instead of unbounded defers, use a WaitGroup and background goroutine to close each chunked writer as soon as its downloads finish. Also rename 'total' to 'received' for clarity.
-
Michael Yang authored
ml/backend/ggml: load tensors in 128KiB chunks
-
Michael Yang authored
-
Bruce MacDonald authored
-
Blake Mizerany authored
-
Parth Sareen authored
This reverts commit ffbfe833.
-