- 29 May, 2025 1 commit
-
-
Jesse Gross authored
This enables matching up devices and information reported by the backend with system management libraries such as nvml to get accurate free memory reporting.
-
- 22 May, 2025 1 commit
-
-
Jesse Gross authored
GGML has a function to report the allocated size of a backend buffer. However, this returns 0 if we tried to allocate a buffer and it failed. For memory management purposes, it's important to know how much we were trying to allocate. This extends the API to report attempted sizes for all buffers and whether it succeeeded.
-
- 14 May, 2025 2 commits
-
-
Bruce MacDonald authored
-
Michael Yang authored
-
- 13 May, 2025 2 commits
-
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
-
- 12 May, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 06 May, 2025 1 commit
-
-
Daniel Hiltgen authored
* Move quantization logic to GGML via new backend This moves the model aware logic to Go code and calls GGMLs quantization code for model creation. * Remove "add model quantizations" This is no longer needed now that quantization is implemented in Go+GGML code directly.
-
- 02 May, 2025 2 commits
-
-
Jesse Gross authored
Worst case graph preallocation was disabled by a27462b7 "ollamarunner: Temporarily disable worst case graph preallocation" since it caused crashes with large batches when not using the GPU. This backports upstream llama.cpp commit f057808 "ggml: Don't assert fail when tensor data changes (#13222)", which fixes the underlying bug and allows reverting the previous workaround.
-
Jeffrey Morgan authored
-
- 25 Apr, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 24 Apr, 2025 1 commit
-
-
Parth Sareen authored
-
- 17 Apr, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 16 Apr, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 15 Apr, 2025 1 commit
-
-
Jesse Gross authored
When ggml_backend_buffer_free() is called, the device memory is released but not all backends consistently release the actual ggml_backend_buffer_t in system RAM, causing a memory leak. Bug #10040
-
- 03 Apr, 2025 1 commit
-
-
Bruce MacDonald authored
Mistral is a popular research lab making open source models. This updates the forward pass of llama architecture models to support both llama models and mistral models by accounting for additional metadata present in mistral models, and finding the correct dimensions for the output projection.
-
- 27 Mar, 2025 1 commit
-
-
saman-amd authored
-
- 15 Mar, 2025 1 commit
-
-
Patrick Devine authored
-
- 11 Mar, 2025 1 commit
-
-
Michael Yang authored
-
- 07 Mar, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 03 Mar, 2025 1 commit
-
-
Michael Yang authored
expand backend loading error handling to catch more problems and log them instead of panicing
-
- 28 Feb, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 27 Feb, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 24 Feb, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 20 Feb, 2025 1 commit
-
-
Michael Yang authored
-
- 19 Feb, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 18 Feb, 2025 1 commit
-
-
Michael Yang authored
sapphire rapids has amx support but it ends up having a negative performance impact. emerald rapids also has amx support with a positive performance impact however there's no reasonable way in ggml to differentiate between the two. the impact is small (~6%) so disable amx entirely for simplicity
-
- 14 Feb, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 11 Feb, 2025 1 commit
-
-
Michael Yang authored
* wrap ggml_backend_load_best in try/catch * ignore non-ollama paths
-
- 10 Feb, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 05 Feb, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 29 Jan, 2025 1 commit
-
-
Michael Yang authored
* add build to .dockerignore * test: only build one arch * add build to .gitignore * fix ccache path * filter amdgpu targets * only filter if autodetecting * Don't clobber gpu list for default runner This ensures the GPU specific environment variables are set properly * explicitly set CXX compiler for HIP * Update build_windows.ps1 This isn't complete, but is close. Dependencies are missing, and it only builds the "default" preset. * build: add ollama subdir * add .git to .dockerignore * docs: update development.md * update build_darwin.sh * remove unused scripts * llm: add cwd and build/lib/ollama to library paths * default DYLD_LIBRARY_PATH to LD_LIBRARY_PATH in runner on macOS * add additional cmake output vars for msvc * interim edits to make server detection logic work with dll directories like lib/ollama/cuda_v12 * remove unncessary filepath.Dir, cleanup * add hardware-specific directory to path * use absolute server path * build: linux arm * cmake install targets * remove unused files * ml: visit each library path once * build: skip cpu variants on arm * build: install cpu targets * build: fix workflow * shorter names * fix rocblas install * docs: clean up development.md * consistent build dir removal in development.md * silence -Wimplicit-function-declaration build warnings in ggml-cpu * update readme * update development readme * llm: update library lookup logic now that there is one runner (#8587) * tweak development.md * update docs * add windows cuda/rocm tests --------- Co-authored-by:
jmorganca <jmorganca@gmail.com> Co-authored-by:
Daniel Hiltgen <daniel@ollama.com>
-
- 08 Jan, 2025 1 commit
-
-
Jeffrey Morgan authored
-
- 17 Dec, 2024 1 commit
-
-
Jesse Gross authored
Sometimes the KV cache requires defragmentation even without triggering the threshold heuristic. In this case, decoding will not being able to find a KV cache slot. This is particularly difficult for the caller to handle if it happens in between ubatches. To avoid this, we should immediately trigger a defrag. In addition, a heavily fragmented cache can require more than max_moves to defragment. Currently, we stop when we hit the limit but this can leave a cache that still does not have adequate space even after defragmentation is triggered. Instead, we should do multiple batches of processing until everything is complete. Fixes #7949
-
- 14 Dec, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 12 Dec, 2024 1 commit
-
-
Parth Sareen authored
-
- 11 Dec, 2024 1 commit
-
-
Jeffrey Morgan authored
-
- 30 Oct, 2024 1 commit
-
-
Jesse Gross authored
-Update mllama to take the cross attention state as embeddings in a batch, more similar to how Llava handles it. This improves integration with the input cache. -Pass locations in a prompt for embeddings using tags similar to Llava. -Abstract interface to vision models so the main runner accesses Clip and Mllama similarly Co-authored-by:Michael Yang <mxyng@pm.me>
-
- 26 Oct, 2024 1 commit
-
-
Daniel Hiltgen authored
On windows compiled with gcc the c++ regex library failed to handle the characters
-
- 18 Oct, 2024 1 commit
-
-
Patrick Devine authored
Co-authored-by:
jmorganca <jmorganca@gmail.com> Co-authored-by:
Michael Yang <mxyng@pm.me> Co-authored-by:
Jesse Gross <jesse@ollama.com>
-