- 01 Jul, 2024 9 commits
-
-
Michael Yang authored
-
Michael Yang authored
-
Daniel Hiltgen authored
Fix case for NumCtx
-
Daniel Hiltgen authored
Document concurrent behavior and settings
-
Daniel Hiltgen authored
This may confuse users thinking "auto" is an acceptable string - it must be numeric
-
Daniel Hiltgen authored
-
Daniel Hiltgen authored
Enable concurrency by default
-
RAPID ARCHITECT authored
* Update README.md Added Mesop example to web & desktop * Update README.md --------- Co-authored-by:Jeffrey Morgan <jmorganca@gmail.com>
-
Eduard authored
Runs fine on a NVIDIA GeForce GTX 1050 Ti
-
- 29 Jun, 2024 2 commits
-
-
Jeffrey Morgan authored
-
Jeffrey Morgan authored
* Do not shift context for sliding window models * truncate prompt > 2/3 tokens * only target gemma2
-
- 28 Jun, 2024 4 commits
-
-
Daniel Hiltgen authored
-
royjhan authored
-
royjhan authored
* Check exists projtype * Maintain Ordering
-
royjhan authored
-
- 27 Jun, 2024 5 commits
-
-
Michael Yang authored
gemma2 graph
-
Michael Yang authored
-
Michael authored
* update readme for gemma 2
-
Michael Yang authored
-
Jeffrey Morgan authored
-
- 25 Jun, 2024 2 commits
-
-
Blake Mizerany authored
Previously, some costly things were causing the loading of GGUF files and their metadata and tensor information to be VERY slow: * Too many allocations when decoding strings * Hitting disk for each read of each key and value, resulting in a not-okay amount of syscalls/disk I/O. The show API is now down to 33ms from 800ms+ for llama3 on a macbook pro m3. This commit also prevents collecting large arrays of values when decoding GGUFs (if desired). When such keys are encountered, their values are null, and are encoded as such in JSON. Also, this fixes a broken test that was not encoding valid GGUF. -
Blake Mizerany authored
This commit changes the 'ollama run' command to defer fetching model information until it really needs it. That is, when in interactive mode. It also removes one such case where the model information is fetch in duplicate, just before calling generateInteractive and then again, first thing, in generateInteractive. This positively impacts the performance of the command: ; time ./before run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.168 total ; time ./before run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.220 total ; time ./before run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./before run llama3 'hi' 0.02s user 0.01s system 2% cpu 1.217 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./after run llama3 'hi' 0.02s user 0.01s system 4% cpu 0.652 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./after run llama3 'hi' 0.01s user 0.01s system 5% cpu 0.498 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with or would you like to chat? ./after run llama3 'hi' 0.01s user 0.01s system 3% cpu 0.479 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./after run llama3 'hi' 0.02s user 0.01s system 5% cpu 0.507 total ; time ./after run llama3 'hi' Hi! It's nice to meet you. Is there something I can help you with, or would you like to chat? ./after run llama3 'hi' 0.02s user 0.01s system 5% cpu 0.507 total
-
- 21 Jun, 2024 8 commits
-
-
Daniel Hiltgen authored
Fix use_mmap parsing for modelfiles
-
Daniel Hiltgen authored
Provide consistent ordering for the ps command - longest duration listed first
-
royjhan authored
-
Daniel Hiltgen authored
Until ROCm v6.2 ships, we wont be able to get accurate free memory reporting on windows, which makes automatic concurrency too risky. Users can still opt-in but will need to pay attention to model sizes otherwise they may thrash/page VRAM or cause OOM crashes. All other platforms and GPUs have accurate VRAM reporting wired up now, so we can turn on concurrency by default.
-
Daniel Hiltgen authored
This adjusts our default settings to enable multiple models and parallel requests to a single model. Users can still override these by the same env var settings as before. Parallel has a direct impact on num_ctx, which in turn can have a significant impact on small VRAM GPUs so this change also refines the algorithm so that when parallel is not explicitly set by the user, we try to find a reasonable default that fits the model on their GPU(s). As before, multiple models will only load concurrently if they fully fit in VRAM.
-
Michael Yang authored
fix: quantization with template
-
Michael Yang authored
-
Daniel Hiltgen authored
Add the new tristate parsing logic for the code path for modelfiles, as well as a unit test.
-
- 20 Jun, 2024 9 commits
-
-
Daniel Hiltgen authored
Refine mmap default logic on linux
-
Daniel Hiltgen authored
Bump latest fedora cuda repo to 39
-
Daniel Hiltgen authored
If we try to use mmap when the model is larger than the system free space, loading is slower than the no-mmap approach.
-
Michael Yang authored
handle asymmetric embedding KVs
-
Josh authored
fix: skip os.removeAll() if PID does not exist
-
Michael Yang authored
-
Josh Yan authored
-
Josh Yan authored
-
Josh Yan authored
-
- 19 Jun, 2024 1 commit
-
-
royjhan authored
* API Show Extended * Initial Draft of Information Co-Authored-By:
Patrick Devine <pdevine@sonic.net> * Clean Up * Descriptive arg error messages and other fixes * Second Draft of Show with Projectors Included * Remove Chat Template * Touches * Prevent wrapping from files * Verbose functionality * Docs * Address Feedback * Lint * Resolve Conflicts * Function Name * Tests for api/show model info * Show Test File * Add Projector Test * Clean routes * Projector Check * Move Show Test * Touches * Doc update --------- Co-authored-by:
Patrick Devine <pdevine@sonic.net>
-