- 25 Jun, 2024 2 commits
-
-
Blake Mizerany authored
Previously, some costly things were causing the loading of GGUF files and their metadata and tensor information to be VERY slow: * Too many allocations when decoding strings * Hitting disk for each read of each key and value, resulting in a not-okay amount of syscalls/disk I/O. The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro m3. This commit also prevents collecting large arrays of values when decoding GGUFs (if desired). When such keys are encountered, their values are null, and are encoded as such in JSON. Also, this fixes a broken test that was not encoding valid GGUF. -
Blake Mizerany authored
This commit changes the 'ollama run' command to defer fetching model information until it really needs it. That is, when in interactive mode. It also removes one such case where the model information is fetch in duplicate, just before calling generateInteractive and then again, first thing, in generateInteractive. This positively impacts the performance of the command: ; time ./before run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.168 total ; time ./before run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.220 total ; time ./before run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.217 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./after run llama3 'hi' 0.02s user 0.01s system 4% cpu 0.652 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./after run llama3 'hi' 0.01s user 0.01s system 5% cpu 0.498 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with or would you like to chat? ./after run llama3 'hi' 0.01s user 0.01s system 3% cpu 0.479 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./after run llama3 'hi' 0.02s user 0.01s system 5% cpu 0.507 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./after run llama3 'hi' 0.02s user 0.01s system 5% cpu 0.507 total
-
- 21 Jun, 2024 5 commits
-
-
Daniel Hiltgen authored
Fix use_mmap parsing for modelfiles
-
royjhan authored
-
Michael Yang authored
fix: quantization with template
-
Michael Yang authored
-
Daniel Hiltgen authored
Add the new tristate parsing logic for the code path for modelfiles, as well as a unit test.
-
- 20 Jun, 2024 9 commits
-
-
Daniel Hiltgen authored
Refine mmap default logic on linux
-
Daniel Hiltgen authored
Bump latest fedora cuda repo to 39
-
Daniel Hiltgen authored
If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.
-
Michael Yang authored
handle asymmetric embedding KVs
-
Josh authored
fix: skip os.removeAll() if PID does not exist
-
Michael Yang authored
-
Josh Yan authored
-
Josh Yan authored
-
Josh Yan authored
-
- 19 Jun, 2024 15 commits
-
-
royjhan authored
* API Show Extended * Initial Draft of Information Co-Authored-By:
Patrick Devine <pdevine@sonic.net> * Clean Up * Descriptive arg error messages and other fixes * Second Draft of Show with Projectors Included * Remove Chat Template * Touches * Prevent wrapping from files * Verbose functionality * Docs * Address Feedback * Lint * Resolve Conflicts * Function Name * Tests for api/show model info * Show Test File * Add Projector Test * Clean routes * Projector Check * Move Show Test * Touches * Doc update --------- Co-authored-by:
Patrick Devine <pdevine@sonic.net>
-
Daniel Hiltgen authored
Implement log rotation for tray app
-
Daniel Hiltgen authored
-
Michael Yang authored
remove confusing log message
-
Michael Yang authored
-
Daniel Hiltgen authored
Move libraries out of users path
-
Daniel Hiltgen authored
Put back temporary intel GPU env var
-
Daniel Hiltgen authored
Fix bad symbol load detection
-
Daniel Hiltgen authored
This reverts commit 755b4e4f.
-
Daniel Hiltgen authored
pointer deref's weren't correct on a few libraries, which explains some crashes on older systems or miswired symlinks for discovery libraries.
-
Daniel Hiltgen authored
Fix levelzero empty symbol detect
-
Blake Mizerany authored
The Digest type in its current form is awkward to work with and presents challenges with regard to how it serializes via String using the '-' prefix. We currently only use this in ollama.com, so we'll move our specific needs around digest parsing and validation there.
-
Wang,Zhe authored
-
Daniel Hiltgen authored
-
- 18 Jun, 2024 7 commits
-
-
Michael Yang authored
deepseek v2 graph
-
Michael Yang authored
-
Daniel Hiltgen authored
Handle models with divergent layer sizes
-
Daniel Hiltgen authored
The recent refactoring of the memory prediction assumed all layers are the same size, but for some models (like deepseek-coder-v2) this is not the case, so our predictions were significantly off.
-
Daniel Hiltgen authored
Tighten up memory prediction logging
-
Daniel Hiltgen authored
Prior to this change, we logged the memory prediction multiple times as the scheduler iterates to find a suitable configuration, which can be confusing since only the last log before the server starts is actually valid. This now logs once just before starting the server on the final configuration. It also reports what library instead of always saying "offloading to gpu" when using CPU.
-
Daniel Hiltgen authored
Adjust mmap logic for cuda windows for faster model load
-
- 17 Jun, 2024 2 commits
-
-
Daniel Hiltgen authored
On Windows, recent llama.cpp changes make mmap slower in most cases, so default to off. This also implements a tri-state for use_mmap so we can detect the difference between a user provided value of true/false, or unspecified.
-
Jeffrey Morgan authored
-