- 20 Feb, 2025 9 commits
-
-
Jesse Gross authored
Currently Rows is called as the last step in a model computation to get the values for the output tokens. However, if we move it earlier in the process then we can trim out computations that never get used. This is similar to how models are defined in llama.cpp. Changing the model definition in this way improves token generation performance by approximately 8%.
-
Jesse Gross authored
We don't need to create and destroy the GGML scheduler for every context. This introduces extra CPU overhead for every forward pass and extra memory for contexts that don't actually get scheduled (for example, KV caches). We can instead just have one scheduler for the backend and reset it each time we call Compute. This improves token generation performance by 1-2% and removes scheduler create/destroy from profile traces.
-
Jesse Gross authored
Currently the following parameters are in the runner but not used: - numGPULayers - mainGPU - threads - tensorSplit This passes them through to the backend, which is where they would actually get used. However, the GGML backend does not yet do anything with them.
-
Bruce MacDonald authored
Added unit tests to verify error handling behavior in the Client.stream and Client.do methods. Tests cover various error scenarios including: - Error responses with status codes >= 400 - Error messages with successful status codes - Empty error messages - Successful responses
-
Michael Yang authored
clang outputs are faster. we were previously building with clang via gcc wrapper in cgo but this was missed during the build updates so there was a drop in performance
-
frob authored
-
danielekp authored
-
Lucas Hahn authored
-
Michael Yang authored
-
- 19 Feb, 2025 5 commits
-
-
Michael Yang authored
build: remove backend build for sapphirerapids
-
yuiseki authored
-
zyxucp authored
-
maninhill authored
-
Jeffrey Morgan authored
-
- 18 Feb, 2025 9 commits
-
-
Michael Yang authored
cmd: fix flickering in progress bar
-
Jeremy Schlatter authored
-
Michael Yang authored
sapphire rapids has amx support but it ends up having a negative performance impact. emerald rapids also has amx support with a positive performance impact however there's no reasonable way in ggml to differentiate between the two. the impact is small (~6%) so disable amx entirely for simplicity
-
Michael Yang authored
-
Michael Yang authored
set owner and group when building the linux tarball so extracted files are consistent. this is the behaviour of release tarballs in version 0.5.7 and lower
-
benhaotang authored
-
L. Jiang authored
-
innightwolfsleep authored
-
Jeremy Schlatter authored
-
- 17 Feb, 2025 2 commits
-
-
Jeremy Schlatter authored
The previous commit fixed flickering in the progress bar itself. Cursor flickering is harder to address. Cursor flickering could be fixed by hiding the cursor altogether while the progress bar is displayed. The downside of this is that if the program is killed in such a way that it can't clean up its state, it would leave the cursor invisible. Instead, this commit introduces an output buffer. All of the escape codes and content for a single progress update are written to a buffer, which is then flushed to the terminal all at once. This significantly decreases the time during which the terminal has seen the cursor-hiding code but has not yet seen the cursor-showing code, thus minimizing (but not 100% eliminating) cursor flickering. For more context, see: https://gitlab.gnome.org/GNOME/vte/-/issues/2837#note_2269501
-
Jeremy Schlatter authored
Previous code cleared the display before writing new content, creating a window where the terminal could (and in some cases did) render empty lines. Instead, we now write new content over the old content, only clearing the trailing end of lines for cases where the new line is shorter. Fixes #1664
-
- 15 Feb, 2025 2 commits
-
-
James-William-Kincaid-III authored
-
Bruce MacDonald authored
-
- 14 Feb, 2025 13 commits
-
-
Daniel Hiltgen authored
-
Jesse Gross authored
We currently print system info before the GGML backends are loaded. This results in only getting information about the default lowest common denominator runner. If we move up the GGML init then we can see what we are actually running. Before: time=2025-02-14T11:15:07.606-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | cgo(gcc)" threads=24 After: time=2025-02-14T11:16:02.936-08:00 level=INFO source=runner.go:935 msg=system info="CPU : LLAMAFILE = 1 | CPU : LLAMAFILE = 1 | CUDA : ARCHS = 890 | USE_GRAPHS = 1 | PEER_MAX_BATCH_SIZE = 128 | CPU : SSE3 = 1 | SSSE3 = 1 | AVX = 1 | AVX2 = 1 | F16C = 1 | FMA = 1 | AVX512 = 1 | AVX512_VBMI = 1 | AVX512_VNNI = 1 | LLAMAFILE = 1 | cgo(gcc)" threads=24
-
Jeffrey Morgan authored
provides a better approach to #9088 that will attempt to evaluate symlinks (important for macOS where 'ollama' is often a symlink), but use the result of os.Executable() as a fallback in scenarios where filepath.EvalSymlinks fails due to permission erorrs or other issues
-
Jeffrey Morgan authored
In some cases, the directories in the executable path read by filepath.EvalSymlinks are not accessible, resulting in permission errors which results in an error when running models. It also doesn't work well on long paths on windows, also resulting in errors. This change removes filepath.EvalSymlinks when accessing os.Executable() altogether
-
Jeffrey Morgan authored
-
Jesse Gross authored
This provides integration with the new Ollama engine (58245413 next ollama runner (#7913)) and the rest of the Ollama infrastructure such as the runner and Ollama server. In addition, it also builds out the KV cache infrastructure to support requirements of how Ollama runs models such as: - Parallel processing - Memory management for defragmentation and shifting - Multi-modal modals Both old and new engines continue to be supported. By default, only the old engine is used. To enable the new engine: Start the server with the OLLAMA_NEW_ENGINE environment variable set: OLLAMA_NEW_ENGINE=1 ./ollama serve Start a model that is supported by the Ollama engine. This one is Llama 3.1 8b Q4_K_M: ./ollama run jessegross/llama3.1
-
Jesse Gross authored
This allows there to be a file that is a list of models that is not mixed into the runner code.
-
Jesse Gross authored
Special tokens are currently read as uint32 from the model metadata. However, all other parts of the system (including the tokenizer) use int32 to represent tokens so it is impossible to represent the high portion of the unsigned range. For consistency and to avoid casts, we should just use int32 everywhere.
-
Jesse Gross authored
Currently, if a model uses an interface for its data structures (as mllama does) then the tensor data in the structs implementing that interface will not get loaded.
-
Jesse Gross authored
-
Jesse Gross authored
We need to sync before retrieving data after async computation. It is also important to ensure that the Go buffer is not moved by the GC across function calls so we do a synchronous copy.
-
Jesse Gross authored
Passing in a Go buffer is not safe because the garbage collector could free or move the memory while the context is still open. However, if we pass in the size and a nil pointer then GGML will allocate it from the C side.
-
Jesse Gross authored
Most tensor backends try to optimize performance by using a lower precision for matmuls. However, some operations (such as kq) on some models are sensitive to this and require full precision.
-